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High-throughput technologies im- 
press us almost every week with novel 
global results and big numbers. They of- 
ten reveal important general trends that 
are impossible to realize with classical, 
low-throughput experimental methods, 
yet (so far) they provide fewer insights 
into specific, molecular detail. Because 
of the amount of data involved, high- 
throughput technologies imply the use 
of bioinformatics methods that deal 
with information transformation, stor- 
age, and analysis. By necessity, most of 
these processes are automated. 

Partly because of the nature of cur- 
rent publication schemes, the accuracy 
and error margins of a given method are 
often only found in small print. It is ob- 
vious that each method has its limits 
and also that during data processing, 
some information will be lost or diluted. 
Because of the current need to integrate 
and add value to data, results from high- 
throughput experiments (if made pub- 
licly accessible) are often taken further 
by third-party research that relies on the 
quality of these data. Thus, I believe that 
public awareness of error margins for 
high-throughput experimental and 
computational methods should be in- 
creased; the incredibly valuable data ac- 
cumulating in various heterogeneous 
databases permit powerful analyses but 
should not be overinterpreted. In the 
following discussion, I will concentrate 
on limits in computational sequence 
analysis, which is far from being perfect 
(Table 1), despite the fact that sequenc- 
ing itself is highly automated and accu- 
rate, and despite the fact that sequence 
information is described in simple linear 
terms (using a four-letter alphabet). On 
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average, a 70% accuracy just to predict 
functional and structural features has to 
be considered a success (Table 1). 

Limitations in the Total Knowledge 
Base of Protein Function 

As these analysis methods are knowl- 
edge based, one of the reasons for the 
inaccuracy is that the quality of data in 
public sequence databases is still insuffi- 
cient (e.g., Bork and Bairoch 1996; Bha- 
tia et al. 1997; Pennisi 1999). This is par- 
ticularly true for data on protein func- 
tion. Protein function is loosely defined; 
cellular function is more than the very 
complicated network of individual mo- 
lecular interactions on which it is based 
(Bork et al. 1998). Furthermore, the se- 
mantics for functional features are not 
always established. For instance, the 
notion of a "protein complex" not only 
depends heavily on detection and puri- 
fication methods — which, in turn, are 
constantly evolving — but also on envi- 
ronmental conditions. Protein function 
is context dependent, and both molecu- 
lar and cellular aspects have to be con- 
sidered (for review, see Bork et al. 1998). 

To illustrate some of this complex- 
ity, a good example is lactate dehydro- 
genase: This gene product can act both 
as a dehydrogenase and an eye lens 
structural protein, depending on its con- 
text (for review, see Piatigorsky and 
Wistow 1991). Even without the compli- 
cation of a second, unrelated role for the 
same gene product, do we know enough 
about the function of lactate dehydroge- 
nase, one of the best-studied proteins? 
We know its biochemical pathway (at 
least in human and some model organ- 
isms), its different isoenzymes (in organ- 
isms) with different context-dependent 



properties, its regulation, and the orga- 
nization of its quaternary structure. 
However, we are probably still missing 
much information, even on crucial mo- 
lecular features: Are we sure about alter- 
native splice variants? Can we exclude 
age-dependent post-translational modi- 
fications in some tissues? Our knowl- 
edge is even more limited regarding 
higher order functions that involve con- 
centration, compartmental organiza- 
tion, dynamics, regulation, and perhaps 
even the impact of external environ- 
ment. Often, the available data give at 
best some reliable qualitative results on 
functional features but far from a com- 
plete understanding of functionality. 
Yet our ability to annotate genome se- 
quences and translate information 
therein relies heavily on the summaries 
of features attached to each sequence in 
the respective public databases. 

Limitations of Gene Expression 
Data Extrapolations 

As more high-throughput technologies 
follow, the data will become more com- 
plicated than sequences. Novel comple- 
mentary data types such as gene expres- 
sion arrays will generate more func- 
tional information, but conclusions 
from these data are often stretched with 
regard to protein products. The expres- 
sion of genes and their reciprocal pro- 
teins seems to correlate weakly, with a 
correlation coefficient of 0.48 (Anderson 
and Seilhammer 1997). Furthermore, re- 
cent studies (Hanke et al. 1999; Mironov 
et al. 1999) show that alternative splic- 
ing might affect >30% of the human 
genes, although measurements at the 
protein level have yet to confirm this. 
Finally, the number of known post- 



398 Genome Research 10:398-400 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org 

www.genome.org 



Downloaded from www.gersome.org on January 19, 2007 



Insight/Outlook 



Table 1. Selected Examples of Prediction Accuracy in Different Areas of Sequence Analysis 

Coverage 







Accuracy 


or coverage in % 




Prediction of 


Acc x cov a 


(%) 


of reference set 


Reference 15 


Human promoters 


0.35 


50 


70% of annotated test set 


Prestidge 1995; P. Bucher 










(pers. comm) 


Human regulatory RNA elements 


0.34 




40% of new DNA 


Dandekar and Sharma (1998) 


Human genes (only presence) 


0.49 


70 


70% of chromosome 22 


Dunham et al. (1999) and 








refs. therein 


Human SNPs by EST comparison 


0.21 


70 


30% of all proteins with SNP 


Buelow et al. (1999); Sunyaev 










et al. (2000) 


• 

Human alternative splicing 


0.45 


90 


50% of all splice sites 


Hanke et al. (1 999) 


Transmembranes (only presence) 


0.85 


85 


99% of annotated test set 


Tusnady and Simon (1998) 








and refs. therein 


Signal peptides (only presence) 


.90 


90 


100% of annotated test set 


Nielsen et al. (1999) 


GPI ancors (incl cleavage site) 


.72 


72 


100% of annotated test set 


Eisenhaber et al. (1999) 


Coiled coil (only presence) 


.81 


90 


90% of annotated coiled coil 


Lupas (1 996) 


Secondary structure (Three states) 


.77 


77 


1 00% of 3D test set 


Jones (1999) and refs. therein 


Buried or exposed residues 


.74 


74 


100% of 3D test set 


Rost (1996) 


Residue hydration 


.72 


72 


100% of 3D test set 


Ehrlich et al. (1998) 


Protein folds (in Mycoplasma) 


.49 


98 


50% of Mycoplasma ORFs 


Teichmann et al. (1999) and 










refs. therein 


Homology (several methods) 


.49 


98 


50% of 3D test set 


Muller et al. (1999) and refs. 








therein 


Functional features by homology 


.63 


90 


70% unicellular genomes 


Bork and Koonin (1998); 










Brenner (1999) 


Function association by context 


.25 


50 


10% high confidence in yeast 


Marcotte et al. (1999b) 


Cellular localization (two states) 


.77 


77 


100% of annotated test set 


Andrade et al. (1998) 



The numbers referred to are in many cases crude estimates taken or sometimes even estimated from the literature and have an expected accuracy of 
-70%. Direct comparison of the numbers might be misleading as the context is not properly explained here. Furthermore, although most of the 
examples are two state predictions, the percentage numbers do not take into account random occurrences of the states. All test sets are most likely 
biased (e.g., current 31) test sets do not contain many compositionally biased regions, which probably contain up 1 5% of all residues, and annotation 
test sets are far from being perfect; see text), i.e., the real accuracy is thus probably lower. 

a To make the numbers more comparable, accuracy has been multiplied by coverage; some methods give accuracy for different degree of coverage and 
roughly justify this procedure. However, often it is biased toward sensitivity as specificity cannot be properly taken into account. Most features predicted 
with an accuracy x coverage >0.70 are of structural nature and at best only indirectly imply a certain functionality. 

''Only one recent reference is given and if indicated, references therein should also be considered as other reports do not always agree with the numbers 
given. 



translational modifications of gene 
products is increasing constantly, so 
that the complexity at the protein level 
is enormous. Each of these modifica- 
tions may change the function of the 
respective gene products drastically. 
(The entire aspect of context-dependent 
gene regulation is excluded from current 
discussions as we are only beginning to 
understand the complex underlying ge- 
netic machinery. For example, promoter 
prediction in eukaryotes has a success of 
only -35% (Table 1), and there are many 
other regulatory elements that we can- 
not predict at all.) 



Limitations Created by 
Third-Party Analyses 

Public releases of completely sequenced 
genomes exceed a rate of one per 
month, with thousands of function pre- 
dictions therein. Gene annotation via 
sequence database searches is already a 
routine job, but even here the error rate 
is considerable (Table 1). The lower limit 
of errors in current functional annota- 
tion of large-scale sequencing projects is 
8% (Brenner 1999). As errors accumulate 
and propagate (Bork and Bairoch 1996; 
Bhatia et al 1997; Smith and Zhang 



1997; Bork and Koonin 1998; Pennisi 
1999), it becomes more difficult to infer 
correct function from the many possi- 
bilities revealed by a database search. In- 
creasing these complications is the fact 
that computer programs often cannot 
even retrieve the source of the stored in- 
formation (Doerks et al. 1998). 

Use of Complementary Information to 
Limit Errors in Function Prediction 

Some new information can be retrieved 
from completely sequenced genomes, 
for example, function can be predicted 
by exploitation of genomic context. 
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Based on the observation that interact- 
ing proteins in one organism sometimes 
have homologs in other organisms fused 
together in a single gene, Marcotte et al. 
(1999a) predicted novel interactions for 
50% of yeast proteins using gene fusion 
information. However, they noted an 
overlap with classical methods and an 
error rate of 82%. To see a signal they 
had to correct for domains present in 
many proteins (Marcotte et al. 1999a). 
By considering only orthologs with fis- 
sion and fusion events (Enright et al. 
1999, Snel et al. 2000), the signal-to- 
noise ratio increases and the number of 
predictions drops dramatically (7% of 
Escherichia coli proteins; Enright et al. 
1999). With a particular question in 
mind, Does protein X have interaction 
partners?, the generation of hypotheses 
is extremely useful; yet to provide a gen- 
eral overview of protein function, it is 
advisable to keep the errors small. Fur- 
ther information can be added later, 
which is easier than retracting stored in- 
formation. But how do we incorporate 
the information on error margins? Such 
estimates (sometimes not even the 
sources of the annotation) are not vis- 
ible in current databases that store the 
results of computational approaches. 

Taking the 70% Hurdle 

As noted above, most prediction 
schemes extrapolate from current 
knowledge, and many bioinformatics 
methods have difficulty exceeding a 
70% prediction accuracy (numbers in 
Table 1 are often overestimates because 
the test sets used are usually not repre- 
sentative of all sequences). On one 
hand, current methods seem to capture 
important features and explain general 
trends; on the other hand, 30% of the 
features are missing or predicted 
wrongly. This has to be kept in mind 



when processing the results further. Also 
the 70% accuracy often attaches to 
methods that deal with discrete objects 
such as sequences; making estimates 
about the prediction of cellular features 
is much more difficult as one first has to 
agree on semantics (or ontology in a da- 
tabase sense) to describe complex pro- 
cesses in a comparable way. 

All of the above focuses on limita- 
tions in the computational prediction of 
qualitative features. There remains a 
long way to go until we are able to de- 
scribe molecular processes quantita- 
tively; current simulations of complex 
systems are still very rough and simplis- 
tic. However, there is still no doubt that 
sequence analysis is extremely powerful 
and that the generation of hypotheses 
derived by computational methods will 
be more and more often the first success- 
ful step in the design of experiments. If 
70% of such experiments were success- 
ful, the speed of scientific discoveries 
would grow exponentially. 

The publication costs of this article 
were defrayed in part by payment of 
page charges. This article must therefore 
be hereby marked "advertisement" in 
accordance with 18 USC section 1734 
solely to indicate this fact. 
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